Protein names precisely peeled off free text

نویسندگان

  • Sven Mika
  • Burkhard Rost
چکیده

MOTIVATION Automatically identifying protein names from the scientific literature is a pre-requisite for the increasing demand in data-mining this wealth of information. Existing approaches are based on dictionaries, rules and machine-learning. Here, we introduced a novel system that combines a pre-processing dictionary- and rule-based filtering step with several separately trained support vector machines (SVMs) to identify protein names in the MEDLINE abstracts. RESULTS Our new tagging-system NLProt is capable of extracting protein names with a precision (accuracy) of 75% at a recall (coverage) of 76% after training on a corpus, which was used before by other groups and contains 200 annotated abstracts. For our estimate of sustained performance, we considered partially identified names as false positives. One important issue frequently ignored in the literature is the redundancy in evaluation sets. We suggested some guidelines for removing overly inadequate overlaps between training and testing sets. Applying these new guidelines, our program appeared to significantly out-perform other methods tagging protein names. NLProt was so successful due to the SVM-building blocks that succeeded in utilizing the local context of protein names in the scientific literature. We challenge that our system may constitute the most general and precise method for tagging protein names. AVAILABILITY http://cubic.bioc.columbia.edu/services/nlprot/

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Microbial quality of shrimp products of export trade produced from aquacultured shrimp.

Bacteriological quality of individually quick frozen (IQF) shrimp products produced from aquacultured tiger shrimp (Penaeus monodon) has been analysed in terms of aerobic plate count (APC), coliforms, Escherichia coli, coagulase-positive staphylococci, Salmonella, and Listeria monocytogenes. Eight hundred forty-six samples of raw, peeled, and deveined tail-on (RPTO), 928 samples of cooked, peel...

متن کامل

Dissecting force interactions in cellulose deconstruction reveals the required solvent versatility for overcoming biomass recalcitrance.

Pretreatment for deconstructing the multifaceted interaction network in crystalline cellulose is a limiting step in making fuels from lignocellulosic biomass. Not soluble in water and most organic solvents, cellulose was found to dissolve in certain classes of ionic liquids (ILs). To elucidate the underlying mechanisms, we simulated cellulose deconstruction by peeling off an 11-residue glucan c...

متن کامل

Effect of turmeric on shrimp (Penaeus semisulcatus) shelf life extension in chilled storage conditions

The present investigation aimed to evaluate the effect of turmeric on shelf life extension of shrimp Penaeus semisulcatus under chilled storage conditions by sensory (organoleptic parameters), pH, proximate and bacterial analysis. The experimental setup was grouped into six, head on (group I), head on coated with turmeric (group II), headless (group III), headless coated with turmeric (group IV...

متن کامل

Flexible metal film with micro- and nanopatterns transferred by electrochemical deposition

A new method for patterning microstructure on metal film by electrochemical deposition is provided. The metal film with micropattern can be peeled off after the deposition by inter-medium layer of resistant molecules such as triglyceride. We can use the technology of electrochemical deposition to make the metal film possess different functions such as soft and bendable properties. We also give ...

متن کامل

Biological Text Mining for Extraction of Proteins and Their Interactions

Text mining techniques have been proposed for extracting protein names and their interactions. First, we have made improvements on existing methods for handling single word protein names consisting of characters, special symbols, and numbers. Second, compound word protein names are extracted using conditional probabilities of the occurrences of neighboring words. Third, interactions are extract...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 20 Suppl 1  شماره 

صفحات  -

تاریخ انتشار 2004